Data Overview: This dataset contains 11 different properties of red wines that may contribute to the overall quality of the wine. At least 3 different wine experts rated the quality of the 1,599 red wines. A rating of 0 would equivalate to a very bad wine, where a 10 would equivalate to a very good wine. During this analysis, we will evaluate the different properties and their effect on predicting the quality of red wine. What factors make a wine better or worse?
Overview of the Entire Dataset:
## Observations: 1,599
## Variables: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
Here is an initial look into what the data looks like, the type of value contained within each column and the number of features describing the wines.
Wine Quality Ratings Overview:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
I wanted to get an inital look into how the quality feature looked. From the two charts above we can see that even though the consumers are given the option to rank the wines from 0-10, they are only actually rated from 3-8. The majority of wines are in the 5 and 6 range, meaning their are average.
The quality chart above shows us that it’s fairly normally distributed. The average quality rating of wines is around a 5, which is what we saw on the box plot above as well. Here is a better view of how many wines (the count) are actually present for each rating.
Here we were able to group the quality into 2 main grouping to better evaluate the quality of wines later in our analysis. You can see here that the majority of wines fall in the “average” range.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the above plots, we can see that not all of the variables have a normal distribution. In order to help this a bit, I would like to log transform at least 2 of the variables: total sulfur dioxide and sulphates.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here we can see that by applying a log transformation to the sulfates data, the distribution begins to look a lot more like a normal distribution.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here we can see that by applying a log transformation to the total sulfur dioxide data, the distribution begins to look a lot more like a normal distribution.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now we can see the distributions look much better after transforming those two variables. Residual sugar and chlorides still have a long-tail, but we will just leave those as is for now. The rest of the features now have a distribution that somewhat resembles a normal distribution. Now that we have analyzed the features individually, I wonder how each of these features affect the quality of wine?
There are 1,599 rows in this dataset which means 1,599 different wines were
compared. There are 12 different columns that describe each of those wines by.
Quality is of a major interest to us during this analysis because it is based
on a factored score and tells us more about what makes up a good wine vs.
a bad wine, aka what a consumer is more likely to purchase based on preference.
I believe the most likely features that will support my investigation are
residual sugar, alcohol, citric acid and pH. This is because I think these features would most affect the overall taste of the wine, causing a consumer
to like or dislike the wine.
I created one new variables called quality groups. This is to be able to better distinguish which wines are good or bad throughout the investigation process. I also created 2 variables where the I used a log10 transformation of the variable because they outputed a more normal distribution for analysis. These variables were total.sulfur.dioxide.log10 and sulphates.log10.
interesting that free sulfur dioxide showed multiple peaks, so we will have to investigate further into that variable.
This matrix helps to give insights into how the features affect on another. It shows how strong of a relationship they have based on the color scale on the side of the matrix. Looking at these correlations, we want to further analyze alcohol, volatile acidity, citric acid, fixed acidity, pH and total sulfur dioxide.
## [,1]
## fixed.acidity 0.11408367
## volatile.acidity -0.38064651
## citric.acid 0.21348091
## residual.sugar 0.03204817
## chlorides -0.18992234
## free.sulfur.dioxide -0.05690065
## total.sulfur.dioxide.log10 -0.19673508
## density -0.17707407
## pH -0.04367193
## sulphates.log10 0.37706020
## alcohol 0.47853169
I also wanted to get a more numeric view at the strength of relationships of each variable against quality. Here I chose to do a spearman correlation analysis because it does not make any assumptions about the distribution of the data and since we still have some of the variables with long-tails, I wanted to play it safe. The variables do not need to all be normally distributed bell curves, which is what we have in our case. That being said, if we take a look at the results from the correlations analysis we can see that alcohol content, volatile acidity, sulphates, and citric acid have the strongest relations to the quality of a wine. So let’s do some more plotting to dig into these 4 variables more.
## wines$quality.groups: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## wines$quality.groups: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## wines$quality.groups: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
Alcohol content seems to have a large impact on a consumer’s rating for the quality of the wine. Here we can see that a good wine’s mean alcohol content is much higher than that of a poor or average wine. The box plot moves average almost moves up by 2%. This is huge. We know now know that consumers prefer wines with higher alcohol content. Just out of curiousity, I’m going to look at how the residual sugar content compares to the alcohol content since the fermented sugars is where the alcohol is coming from.
There’s not much of a relationship here. I thought there would be a bit more since if the wine has a higher alcohol content I figured there would be less residual sugars, meaning they would disappear into alcohol. It seems that’s not how it works and the residual sugars pretty much stay constant regardless of alcohol content. The next feature we will dive into is volatile acidity.
## wines$quality.groups: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## --------------------------------------------------------
## wines$quality.groups: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## wines$quality.groups: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
Volatile acidity decreases as the quality of the wine increase on average. I found out online that volatile acidity accounts for the gaseous parts of the wine that can be experienced mostly through the smell of the wine. If there is a higher level of volatile acids present it can also sometimes make the wine taste a bit like vinegar. So this relationship makes sense that less volatile acids would be desired and why the good wines are showing to have less of these acids. I’m going to take a look at volatile acidity vs. alcohol now to see if they have a relationship. If the volatile acidity affects things like taste and smell, it might also affect the fermentation process and the amount of alcohol content that ends up in the final wine.
The volatile acidity vs. alcohol content doesn’t seem to be too telling. Let’s look and see if volatile acidity has an effect on pH level.
This chart shows that volatile acidity and pH have no real pattern or relation to one another that we can decipher. It may look like volatile acidity increases slightly as pH level increases, but it’s not a strong enough relation to investigate more. Now, let’s dig into the next feature with the strongest relationship to quality which is sulphates.
## wines$quality.groups: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4950 0.5600 0.5922 0.6000 2.0000
## --------------------------------------------------------
## wines$quality.groups: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6473 0.7000 1.9800
## --------------------------------------------------------
## wines$quality.groups: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
The above box plot is the relation of sulphates to quality taking the raw data. If you remember though, we log transformed sulphates so we are going to make one with the transformed data as well.
## wines$quality.groups: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4815 -0.3054 -0.2518 -0.2456 -0.2218 0.3010
## --------------------------------------------------------
## wines$quality.groups: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4318 -0.2676 -0.2147 -0.2004 -0.1549 0.2967
## --------------------------------------------------------
## wines$quality.groups: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.40894 -0.18709 -0.13077 -0.13572 -0.08619 0.13354
This box plot looks much better. There is less outliers and we can distinguish a better relationship. As sulphates present in the wine increase, it seems the overall quality of the wine improves. I read online that many people are turned off by suphates as they think it leads to giving them headaches and it’s an added chemical to your wine. Although this may be true, sulphates play a key role in controlling the fermentation and balance of the wine before and after the winemaking process. This is why it’s important for producers to add this chemical. You wouldn’t want to open up a funky wine when you got home from the liquor store. Since sulphates affects a lot of the same things as I learned volatile acidity does, let’s see if these two features have some sort of relationship.
Here I was just curious if the sulphates would affect the acidity level of the wine since it play a big part in the fermentation process. Here we can see as the volatile acidity level increases, the sulphates decrease. This is interesting because a lower volatile acidity is desired so it follows the story. Now we will move onto looking at the next feature compared to quality, which is citric acid.
## wines$quality.groups: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0800 0.1737 0.2700 1.0000
## --------------------------------------------------------
## wines$quality.groups: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2583 0.4000 0.7900
## --------------------------------------------------------
## wines$quality.groups: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.4000 0.3765 0.4900 0.7600
It seems that as the citric acid level increases, the quality of wine also increases. So a higher level of citric acid is desired, while a lower level of volatile acids are desired. Let’s see how they relate in a scatter plot.
For the most part, as citric acid increases there’s a trend showing volatile acidity decreasing so the relation holds true. I’m also curious to see how citric acid would affect the pH level of the wine since it’s desired in higher amounts.
This is not the result I was expecting. I was thinking that the more citric acid present in the wine, the higher the pH level of the wine would be. But it seems to follow the opposite relationship.
I found that as volatile acidity increases, the sulphates present in the wine tend to decrease. This could be because sulphates try to regulate and control the fermentation and balance of the wine and if it’s a more volatile wine, it doesn’t have as much of that regulating agent added. Whatever the case, it was an interesting find. I also found that as the citric acid in wine increases, the pH level of the wines decrease.
The strongest relationship I found was between the alcohol content present in wine and the quality of the wine.
The first couple of charts are going to start with plotting alcohol content against a couple of different features. This is because we found out that alcohol content played a major role in consumer’s ratings of the overall wine quality.
Here is the plot with the raw, untransformed sulphates data.
Here is the plot once the data has been log transformed. We can see the better quality wines have both higher alcohol content and a higher amount of sulphates present. This follows what we have been finding in previous plots, but provides a better visual of the overall relationship. The good wines have more sulphates and alcohol content present. We know previously that this was desired when taking a look at those two features seperately, but we can see here by plotting them against one another that it follows the trend and seems to remain true. Next, we are going to take a look at alcohol vs. citric acid (another feature with a strong relationship to quality).
This doesn’t really give us a lot of information. Let’s continue exploring against the next feature, volatile acidity.
Higher alcohol content and less volatile acidity rates the best here, which follows the story we have been painting throughout this investigation. Next, let’s take a look volatile acidity vs. sulphates since we have completed looking at all the main features plotted next to the alcohol feature.
More sulphates with less volatile acidity rates best here, which again is what we were expecting. It’s a better view with the coloring because not only can you see the trendline, but where the best wines end up scoring for these two variables. We can solidly say that less volatile acidity, more sulphates and a higher alcohol content is more desired by consumers at this point. We should also consider plotting volatile acidity vs. citric acid to see how the two different acids present in the wine have a relationship and how it affects the overall rating of the wine.
More citric acids with less volatile acidity rates best here. This makes sense since we know that citric acids leave the wine tasting crisp and fresh while the volatile acids can leave a funky taste to the wine. Laslty, I think it would be important to consider plotting the last two features with strong relations to quality against one another, citric acid vs. sulphates- the log transformed data.
More of both sulphates and citric acids is what rates the best quality. Since sulphates vs citric acid content seems to provide what appears to be the strongest trend, I will do a correlation analysis of the 2 of them.
##
## Pearson's product-moment correlation
##
## data: citric.acid and log10(sulphates)
## t = 14.042, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2871618 0.3744520
## sample estimates:
## cor
## 0.3315162
This multivariate analysis was used to compare the main feature- quality- against the strongest rating correlations. It strengthed the deductions we have already made in the previous section.
Citric acids vs. a normalized (log10 transformed) sulphates actually turned out to have a strong relationship. A higher amount of citric acid and sulphates seems to rate really well among customers. This is important because then winemakers can take this into account when they are thinking about their end product and how they want their fermenting process to turn out.
This shows the distribution of how many wines fall in each of the different rating categories. We can see it creates a normal distribution with the majority of wines fall in the middle range around the 5 or 6 rating. During the rating process, consumers were given the option to rate wines from 0-10, although, we only end up with wines being rated from 3-8.
## wines$quality.groups: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## wines$quality.groups: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## wines$quality.groups: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
I chose this plot because there is a significant relationship between the quality of wine and the alcohol content in the wine. There is a major jump for the last box plot which is extremely telling. I used this finding to guide the following investigation Since it showed the biggest jump of all the features.
I thought this plot was important because these are the two features that showed the strongest relationship to the quality of a wine. Therefore, we can learn a lot by plotting them against on another. This tells me that it’s crucial for a wine to have low volatile acidity while maintaining and adequate amount of alcohol content in order for the wine’s quality rating to be average or good. ——
This dataset came from a study conducted around the 2009 time period. It contained data on 1599 different wines accounting for 12 different features of the wines. We wanted to know how these features ultimately affected the quality ratings as percieved by consumers. In order to go about this, we first started looking at the distribution and broke down the quality into 3 main groups: poor, average, and good. This was to have an easier, more clear way of viewing the ratings on charts. Next, we dove into understanding each of the features on their own. We analyzed their distributions and made any transformations necessary to conduct the rest of our investigation. After that, we started to analyze the effect each feature had on one another. I plotted the features with the strongest relationships in order to get a better understanding of how that relationship functioned.
The four features with the strongest correlation/relationship to the quality of the wine were alcohol content, volatile acidity, sulphates, and citric acid. It turns out consumer preferred a wine that had a higher alcohol content and they ranked those wines much better than those with a low content. Volatile acidity was not wanted in the wines. Consumers rated wines better when the volatile acidity was low. This makes sense since it can alter the taste of the wine if there is too much volatile acidity present. As for sulphates, the consumers seemed to rate a wine better if the sulphate content was a bit higher. Consumers also rated wines better if there was a higher amount of citric acid in the wine. This also makes sense since the citric acid present will create a crisp, fresh taste to the wine.
Other interesting relationships became apparent as well. Volatile acidity and sulphates shared a strong relationship between one another. As the sulphate content increased, the volatile acidity tended to decrease. Also, we found that as the citric acid content of a wine increased, the pH level decreased. This was opposite of what I was expecting, but I found out online that the pH level can drastically affect the end taste of the wine and in general a lower pH level is desired according to the website. Knowing that, it makes sense that since a higher citric acid content is desired and a low pH level is also desired that it would follow this relationship.
It was sometimes hard to comprend all of my findings since the file with all the individual plots became very long and there was a lot of scrolling I needed to do in order to go back and look at something again. It took awhile for me to kinda get used to. Next, I also struggled with figuring out a way to display the charts all side-by-side. I found a way to do this via some research on the internet thankfully. What went well was my ability to take the plots we learned how to use and actually apply them to this dataset. I was surprised by the quickness of plotting in R as compared to Python, which can be cumbersome at times.
In the future, I would love a winemaker to take the findings of this investigation and aim to make an “optimal” wine.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.